1 Introduction

In this R Markdown document, we explore the HCC1395 CNA predictions from OncoSNP and TITAN. The objective is to demonstrate some other analyses that can be performed on these results using R.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Inside of RStudio when you click the Knit button in the document, it will be generate an html includes both content as well as the output of any embedded R code chunks within the document.

Let’s start by loading the R packages that you will need for doing these analyses.

library("data.table")
library("ggplot2")
library("plyr")
library("dplyr")
library("stringr")
library("knitr")
library("reshape2")

2 OncoSNP Analysis

Let’s load the OncoSNP data.

# load the modified OncoSNP cnvs file
# this file has LRR and BAF values added for each segment
oncosnpDt <- fread("HCC1395.logR.baf.cnvs")
oncosnpDt <- oncosnpDt[, chr := factor(chr, levels = c(1:22, "X"))]

# summarize the tumour states more broadly
# this allows for easier visualization of the data
# as this is a cell-line, we treat all germline states as somatic
oncosnpDt <- oncosnpDt %>%
  mutate(state.modified = ifelse(state == 1, "HOMD", 
                          ifelse(state == 2, "HETD",
                          ifelse(state == 3, "NEU",
                          ifelse(state == 4, "3N_GAIN",
                          ifelse(state %in% c(5:13), "4-8N_GAIN",
                          ifelse(state %in% c(14:20), "LOH",
                          ifelse(state %in% c(21:28), "LOH", NA))))))))

# generate the length of each chromosome
chrLenDt <- oncosnpDt %>%
  mutate(segLen = end - start + 1) %>%
  group_by(chr) %>%
  summarize(chrLen = sum(segLen))

# set colors for each copy number state
states.col <- c("HOMD" = "#1F78B4",
                "HETD" = "#A6CEE3", 
                "NEU" = "lightgrey", 
                "3N_GAIN" = "#FB9A99", 
                "4-8N_GAIN" = "#E31A1C", 
                "LOH" = "#33A02C",
                "GERMLINE" = "#006837")

oncosnpDt.qc <- fread("HCC1395.qc")

# rename columns more easier processing downstream
setnames(oncosnpDt.qc, 
         c("Log-likelihood", "LogRRatioShift", "Copy Number (Average)"), 
         c("loglikelihood", "LRR.shift", "ploidy"))

rawProbeDt <- fread("HCC1395.ploidyConfig_1.rawProbe")
## 
Read 70.9% of 916285 rows
Read 916285 rows and 11 (of 11) columns from 0.052 GB file in 00:00:04
rawProbeDt <- rawProbeDt[, chr := factor(chr, levels = c(1:22, "X"))]
setkey(rawProbeDt, probeID)

#' Given a stateRankDf (from rankState1 to rankState5) and a maxrank, it will 
#' return the state with the highest rank. If the highest rank is NA, then 
#' it will go down the list of ranks until it finds a state that is not NA
#' 
#' @param 
summarizeProbeState <- function(stateRankDf, maxRank = 5) {
  states.summarized <- stateRankDf[, maxRank ]
  if (maxRank != 1){
    for ( i in (maxRank-1):1 ){
      states.summarized[is.na(states.summarized)] <- stateRankDf[is.na(states.summarized), i]
    }
  }
  states.summarized
}

#' This function is used to label the facet-grids

# Use this code for ggplot < 2.0.0
# facet_labeller <- function(var, value){
#     value <- as.character(value)
#     if (var=="cnMeasure") { 
#         value[value == "logRShifted"] <- "LRR"
#         value[value == "baf"] <- "BAF"
#     }
#     return(value)
# }

# Use this code for ggplot version > 2.0.0
facet_labeller <- labeller("logRShifted" = "LRR",
                           "baf" = "BAF")

stateRankDf <- rawProbeDt[, str_c('rankState', 1:5), with = FALSE] %>%
  data.frame
rawProbeDt <- cbind(rawProbeDt, finalState = summarizeProbeState(stateRankDf, maxRank = 5))
rawProbeDt <- rawProbeDt %>%
  mutate(finalState.modified = ifelse(finalState == 1, "HOMD", 
                               ifelse(finalState == 2, "HETD",
                               ifelse(finalState == 3, "NEU",
                               ifelse(finalState == 4, "3N_GAIN",
                               ifelse(finalState %in% c(5:13), "4-8N_GAIN",
                               ifelse(finalState %in% c(14:20), "LOH",
                               ifelse(finalState %in% c(21:28), "LOH", NA))))))))

rawProbeDt.melt <- rawProbeDt %>%
  melt(id.vars = c("probeID", "chr", "pos", "finalState.modified"), 
       measure.vars = c("logRShifted", "baf"), 
       variable.name = "cnMeasure")

2.1 Quality Control

Let’s first take a look at the quality control metrics from OncoSNP

kable(oncosnpDt.qc)
LRR.shift NormalContent ploidy loglikelihood OutlierRate LogRRatioStd BAlleleFreqStd PloidyNo
-0.186 0 2.9 390070.0 0 0.238 0.041 1
-0.171 0 2.9 389991.3 0 0.238 0.041 2

This table informs us on the statistics of the two OncoSNP runs. That is:

  1. One run initialized to diploid
  2. One run initialized to non-diploid

The row with the PloidyNo = 1 is the OncoSNP run that has the higher probability of being the correct one. The log-likelihood (i.e. probabilities) are actually quite similar, however the predicted ploidy is similar. This is an ideal scenario as both initialization coverged to similar solutions for the ploidy level.

2.2 Segment Analysis

Let’s first take a look how the copy number alteration segments distribute across the genome.

oncosnpDt %>%
  mutate(segLen = end - start + 1) %>%
  left_join(chrLenDt) %>%
  mutate(segChrProp = segLen / chrLen) %>%
  arrange(chr, state) %>%
  ggplot(aes(x = chr, y = segChrProp, fill = factor(state.modified))) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_manual(name = "Tumour State", values = states.col) +
  xlab("Chromosome") +
  ylab("Proportion of Chromosome")
## Joining by: "chr"

OncoSNP by default will assign a copy number state (1 of 28) that a segment belongs to. In this lab, we’ve collapsed these states into 1 of 6 to simply visualization:

  1. 3N_GAIN: 3 copies
  2. 4-8N_GAIN: 4-8 copies
  3. HETD: Heterozygous Deletion
  4. HOMD: Homozygous Deletion
  5. LOH: Loss of Heterozygous
  6. NEU: Neutral

2.3 Chromosome Plots

Let’s take a closer took at some of the LRR and BAF plots for each chromosome. The benefit of using R here is we can add some more additional annotations such as colors to the tumour states. Here we plot just the first chromosome.

rawProbeDt.melt %>%
  filter(chr == 1) %>%
  ggplot(aes(pos, value, color = factor(finalState.modified))) + 
  geom_point(shape = 1) + 
  facet_grid(cnMeasure ~ ., scales = "free", labeller = facet_labeller) + 
  scale_color_manual(name = "Tumour State", values = states.col) +
  xlab("Position") + 
  ylab("")

Take note of the separation of the alleles in the BAF plot in LOH regions.

Do you understand the BAF signal in the “3N GAIN” region?

Feel free to change the chromosome and even plot more focal regions by setting the range on the pos column. For instance, here we plot a focal region on chromosome 1 (50 - 100MB):

rawProbeDt.melt %>%
  filter(chr == 1, pos > 50000000, pos < 100000000) %>% 
  ggplot(aes(pos, value, color = factor(finalState.modified))) + 
  geom_point(shape = 1) + 
  facet_grid(cnMeasure ~ ., scales = "free", labeller = facet_labeller) + 
  scale_color_manual(name = "Tumour State", values = states.col) +
  xlab("Position") + 
  ylab("")

Can you figure out how to visualize the chromosome plot of chromosome 19?

3 TITAN Analysis

Now let’s take a look at the TITAN results

titanDt <- fread("HCC1395_exome_tumour.results.segs.txt")
titanDt <- titanDt[, Chromosome := factor(Chromosome, levels = c(1:22, "X"))]

# generate the length of each chromosome
chrLenDt.titan <- titanDt %>%
  mutate(segLen = End_Position - Start_Position + 1) %>%
  group_by(Chromosome) %>%
  summarize(chrLen = sum(segLen))

# set colors for each copy number state
titan.states.col <- c("HOMD" = "#1F78B4",
                      "DLOH" = "#A6CEE3", 
                      "HET" = "lightgrey", 
                      "GAIN" = "#FB9A99", 
                      "BCNA" = "#E31A1C", 
                      "UBCNA" = "#E31A1C", 
                      "ASCNA" = "#E31A1C", 
                      "NLOH" = "#33A02C", 
                      "ALOH" = "#33A02C")

3.1 Segment Analysis

Just like OncoSNP, we have the ability to look at the copy number alteration distribution across the genome.

titanDt %>%
  mutate(segLen = End_Position - Start_Position + 1) %>%
  left_join(chrLenDt.titan) %>% 
  mutate(segChrProp = segLen / chrLen) %>% 
  arrange(Chromosome, TITAN_state) %>%
  ggplot(aes(x = Chromosome, y = segChrProp, fill = factor(TITAN_call))) +
  geom_bar(stat = "identity", position = "fill") +
  xlab("Chromosome") +
  ylab("Proportion of Chromosome") +
  scale_fill_manual(name = "Tumour State", values = titan.states.col)
## Joining by: "Chromosome"

You may notice that states have different state names between OncoSNP and TITAN. Each program typically has their own nomenclature on how states are defined. It’s important to understand the state names. For TITAN, this information can be found in the paper.

3.2 Chromosome Plots

Detailed chromosome plots are provided by TITAN already. For exome data, these plots will be much more sparse since we don’t have the same coverge as in genomes. We won’t reproduce these plots in R, but you can use the code above as a framework.

4 R Session


sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin11.4.2 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.1   knitr_1.13       stringr_1.0.0    dplyr_0.4.3     
## [5] plyr_1.8.3       ggplot2_2.1.0    data.table_1.9.6 nvimcom_0.9-14  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2      magrittr_1.5     munsell_0.4.2    colorspace_1.2-6
##  [5] R6_2.1.1         highr_0.5.1      tools_3.2.2      parallel_3.2.2  
##  [9] grid_3.2.2       gtable_0.1.2     DBI_0.3.1        htmltools_0.3   
## [13] lazyeval_0.1.10  yaml_2.1.13      digest_0.6.9     assertthat_0.1  
## [17] formatR_1.2.1    evaluate_0.8     rmarkdown_0.9.6  labeling_0.3    
## [21] stringi_1.0-1    scales_0.3.0     chron_2.3-47